

This repo is divided into two parts:

All R scripts can be run using R 4.4.2 and require the environment defined in libs.R. All .py scripts should be run in a Spark context (specification below). All scripts were initially run on large cluster nodes (est. 1 TB RAM). For smaller hardware, linkage scripts should be salted.

1) 

creation/ provides the scripts necessary to generate our analysis dataset, beginning from raw data sources (L2, CoreLogic, and FEC). Scripts should be run in the following order from the creation/ root. Scripts in the same subfolder with the same number can be run concurrently. 

- Script in gender/ may be optionally run to produce gender/names_gender.rds
- Script in l2/ 
	- l2_inventory.csv provides names of raw files that can be obtained from L2
- Scripts in cl/ must be run in order.
	- cl_inventory.csv provides names of raw files that can be obtained from CoreLogic
	- 03_ scripts must be run as an array job, passing 1 through 51 as integer arguments.
- Scripts in fec/ must be run in order.
	- 01_to_arrow.R assumes access to a restored FEC Schedule A Postgres mirror. Details here: https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/index.html?prefix=bulk-downloads/data-dump/schedules/
	- 03_ scripts must be run as an array job, passing 1 through 7 as integer arguments.
- Scripts in cl_fec/, cl_l2/ and fec_l2/ can all run concurrently, but each must be run as an array job, each accepting 1 through 51 as first integer arguments, and each of 2012, 2016, 2020 as second integer arguments.
- Run all scripts in joining/

2) 

analysis/ provides all scripts that generate from the merged analysis dataset all tables, figures, and reported quantities. analysis/make_dataset.R should be run first to generate .parquet files. Scripts in analysis/make_summary generate corresponding (named) summary_data/ files and can be run in any order as long as user has access to underlying analysis dataset. analysis/text_stats.R produces quantities reported outside of tables and figures.

Reported tables and figures can be generated from this replication archive by running all scripts in analysis/make_contents in any order. (The underlying data for Fig A2 is excluded. Please see the Data Availability Statement.)

===================
PYTHON/SPARK SETUP
===================

Spark 3.2 (Pyspark on Python 3.10)
Splink 3.0.1 on Python 3.10 using the Splink JAR scala-udf-similarity-0.0.9.jar
We also required the GraphFrames JAR graphframes:0.8.2-spark3.2-s_2.12

Full Python (Conda) environment below:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
altair                    4.2.0                    pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
blas                      1.0                         mkl  
bottleneck                1.3.5           py310ha9d4c09_0  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2022.4.26            h06a4308_0  
certifi                   2022.6.15       py310h06a4308_0  
cramjam                   2.5.0                    pypi_0    pypi
duckdb                    0.4.0                    pypi_0    pypi
entrypoints               0.4                      pypi_0    pypi
fastparquet               0.8.1                    pypi_0    pypi
fsspec                    2022.5.0                 pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561  
jinja2                    3.1.2                    pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.0.3                h7f8727e_2  
markupsafe                2.1.1                    pypi_0    pypi
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0           py310h7f8727e_0  
mkl_fft                   1.3.1           py310hd6ae3a3_0  
mkl_random                1.2.2           py310h00e6091_0  
ncurses                   6.3                  h5eee18b_3  
numexpr                   2.8.3           py310hcea2de6_0  
numpy                     1.22.3          py310hfa59a62_0  
numpy-base                1.22.3          py310h9585f30_0  
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3               pyhd3eb1b0_0  
pandas                    1.4.3           py310h6a678d5_0  
pip                       22.1.2          py310h06a4308_0  
pyarrow                   8.0.0                    pypi_0    pypi
pyparsing                 3.0.4              pyhd3eb1b0_0  
pyrsistent                0.18.1                   pypi_0    pypi
python                    3.10.4               h12debd9_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytz                      2022.1          py310h06a4308_0  
readline                  8.1.2                h7f8727e_1  
setuptools                61.2.0          py310h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
splink                    3.0.1                    pypi_0    pypi
sqlglot                   4.1.1                    pypi_0    pypi
sqlite                    3.38.5               hc218d9a_0  
tk                        8.6.12               h1ccaba5_0  
toolz                     0.12.0                   pypi_0    pypi
tzdata                    2022a                hda174b7_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.5                h7f8727e_1  
zlib                      1.2.12               h7f8727e_2  

 

===================
CoreLogic/L2 Files
===================


The L2 and CoreLogic files necessary to run the code are proprietary, but can be obtained by researchers directly from the vendors (https://www.l2-data.com/ for L2 data, https://www.corelogic.com/360-property-data/ for CoreLogic). We obtained the data via subscriptions held by Princeton University.

The file versions and names are listed in the following places, and they should be stored in the same directory as the script cleaning each set: 

L2 - creation/l2/l2_inventory.csv 

CoreLogic - creation/cl/cl_inventory.csv

The version of each L2 file should be clear from the information in l2_inventory.csv. For CoreLogic, we use 3 separate snapshots from full national files, each pulled on August 2, 2022 (seen in the "20220802" part of the file names). These were rolling snapshots that roughly correspond with data compiled through 2020 (the "_01_" file), 2016 ("_06_"), and 2012 ("_10_"). Researchers can request these specific files with these markers from CoreLogic.
